'Living' Conclusion Gathering Space¶

Conclusions Data Merge¶

  • There seems to be no pattern in the null values in target column. This indicates we can drop these rows

County

  • we think county 12 is weird (unknown location) and introduces NA values (not included in weather data)
  • maybe drop 12, but this may lead to other problems
  • county named 'unknown'
  • the forums say counties 16 and 0 may be weird

data_block_id

  • we could reduce NaN and NaT values by excluding data_block_id 1 and 0 (beginning in data set)

Modelling / Time Series

  • We are unsure about modelling (is time series model needed? maybe ARMA?) Forums suggest e.g. XGBoost

Conclusions EDA¶

  • consumption has noticeable affects by winter holidays
  • seems consumption is growing over time
  • county 0 is dominating, Tallinn located there
  • seems like temperature to production ratio changed last year
  • product_type 2 attract producers with small installed capacity thus low production
  • while product_type 3 attract the opposite cluster, producers with a lot of installed_capacity
  • Surface solar radiation seems to have a stronger correlation with target than direct_solar
  • There seems to be a 'split' around 6000 (unit?) daily mean target
  • We expected more businesses in the top-producers

Preparing Data¶

In [106]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
In [107]:
client = pd.read_csv('../data/client.csv')
client.head()
Out[107]:
product_type county eic_count installed_capacity is_business date data_block_id
0 1 0 108 952.89 0 2021-09-01 2
1 2 0 17 166.40 0 2021-09-01 2
2 3 0 688 7207.88 0 2021-09-01 2
3 0 0 5 400.00 1 2021-09-01 2
4 1 0 43 1411.00 1 2021-09-01 2

Client Data¶

In [108]:
# Datatype conversion
client.date = pd.to_datetime(client.date)
client.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41919 entries, 0 to 41918
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   product_type        41919 non-null  int64         
 1   county              41919 non-null  int64         
 2   eic_count           41919 non-null  int64         
 3   installed_capacity  41919 non-null  float64       
 4   is_business         41919 non-null  int64         
 5   date                41919 non-null  datetime64[ns]
 6   data_block_id       41919 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(5)
memory usage: 2.2 MB
In [109]:
# categoricals?
display(client.product_type.unique())
display(client.is_business.unique())
display(client.county.unique())
array([1, 2, 3, 0])
array([0, 1])
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])
In [110]:
sns.scatterplot(x = client.date, y = client.data_block_id)
Out[110]:
<Axes: xlabel='date', ylabel='data_block_id'>
No description has been provided for this image
In [111]:
display(min(client.date))
display(max(client.date))
display(client.data_block_id.nunique())
Timestamp('2021-09-01 00:00:00')
Timestamp('2023-05-29 00:00:00')
636
First day is '2021-09-01 00:00:00', last day is '2023-05-29 00:00:00'. There are 636 unique days, and data_block_id corresponds to date. 

Electricity Prices Data¶

In [112]:
electricity_prices = pd.read_csv('../data/electricity_prices.csv')
electricity_prices.head()
Out[112]:
forecast_date euros_per_mwh origin_date data_block_id
0 2021-09-01 00:00:00 92.51 2021-08-31 00:00:00 1
1 2021-09-01 01:00:00 88.90 2021-08-31 01:00:00 1
2 2021-09-01 02:00:00 87.35 2021-08-31 02:00:00 1
3 2021-09-01 03:00:00 86.88 2021-08-31 03:00:00 1
4 2021-09-01 04:00:00 88.43 2021-08-31 04:00:00 1
In [113]:
electricity_prices.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15286 entries, 0 to 15285
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   forecast_date  15286 non-null  object 
 1   euros_per_mwh  15286 non-null  float64
 2   origin_date    15286 non-null  object 
 3   data_block_id  15286 non-null  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 477.8+ KB
In [114]:
electricity_prices.forecast_date = pd.to_datetime(electricity_prices.forecast_date)
electricity_prices.origin_date = pd.to_datetime(electricity_prices.origin_date)

electricity_prices.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15286 entries, 0 to 15285
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   forecast_date  15286 non-null  datetime64[ns]
 1   euros_per_mwh  15286 non-null  float64       
 2   origin_date    15286 non-null  datetime64[ns]
 3   data_block_id  15286 non-null  int64         
dtypes: datetime64[ns](2), float64(1), int64(1)
memory usage: 477.8 KB
In [115]:
electricity_prices.nunique()
Out[115]:
forecast_date    15286
euros_per_mwh    11025
origin_date      15286
data_block_id      637
dtype: int64
In [116]:
electricity_prices.forecast_date.unique()
Out[116]:
<DatetimeArray>
['2021-09-01 00:00:00', '2021-09-01 01:00:00', '2021-09-01 02:00:00',
 '2021-09-01 03:00:00', '2021-09-01 04:00:00', '2021-09-01 05:00:00',
 '2021-09-01 06:00:00', '2021-09-01 07:00:00', '2021-09-01 08:00:00',
 '2021-09-01 09:00:00',
 ...
 '2023-05-30 14:00:00', '2023-05-30 15:00:00', '2023-05-30 16:00:00',
 '2023-05-30 17:00:00', '2023-05-30 18:00:00', '2023-05-30 19:00:00',
 '2023-05-30 20:00:00', '2023-05-30 21:00:00', '2023-05-30 22:00:00',
 '2023-05-30 23:00:00']
Length: 15286, dtype: datetime64[ns]
Electricity price forecast are available for each hour of the day (637 days * 24 hours = 15286 unique datetimes) (for 637 days, one day more than client data; somewhere there are 2h missing)

Forecast Weather Data¶

In [117]:
forecast_weather = pd.read_csv('../data/forecast_weather.csv')

forecast_weather.head()
Out[117]:
latitude longitude origin_datetime hours_ahead temperature dewpoint cloudcover_high cloudcover_low cloudcover_mid cloudcover_total 10_metre_u_wind_component 10_metre_v_wind_component data_block_id forecast_datetime direct_solar_radiation surface_solar_radiation_downwards snowfall total_precipitation
0 57.6 21.7 2021-09-01 00:00:00+00:00 1 15.655786 11.553613 0.904816 0.019714 0.000000 0.905899 -0.411328 -9.106137 1 2021-09-01 01:00:00+00:00 0.0 0.0 0.0 0.0
1 57.6 22.2 2021-09-01 00:00:00+00:00 1 13.003931 10.689844 0.886322 0.004456 0.000000 0.886658 0.206347 -5.355405 1 2021-09-01 01:00:00+00:00 0.0 0.0 0.0 0.0
2 57.6 22.7 2021-09-01 00:00:00+00:00 1 14.206567 11.671777 0.729034 0.005615 0.000000 0.730499 1.451587 -7.417905 1 2021-09-01 01:00:00+00:00 0.0 0.0 0.0 0.0
3 57.6 23.2 2021-09-01 00:00:00+00:00 1 14.844507 12.264917 0.336304 0.074341 0.000626 0.385468 1.090869 -9.163999 1 2021-09-01 01:00:00+00:00 0.0 0.0 0.0 0.0
4 57.6 23.7 2021-09-01 00:00:00+00:00 1 15.293848 12.458887 0.102875 0.088074 0.000015 0.176590 1.268481 -8.975766 1 2021-09-01 01:00:00+00:00 0.0 0.0 0.0 0.0
In [118]:
forecast_weather.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3424512 entries, 0 to 3424511
Data columns (total 18 columns):
 #   Column                             Dtype  
---  ------                             -----  
 0   latitude                           float64
 1   longitude                          float64
 2   origin_datetime                    object 
 3   hours_ahead                        int64  
 4   temperature                        float64
 5   dewpoint                           float64
 6   cloudcover_high                    float64
 7   cloudcover_low                     float64
 8   cloudcover_mid                     float64
 9   cloudcover_total                   float64
 10  10_metre_u_wind_component          float64
 11  10_metre_v_wind_component          float64
 12  data_block_id                      int64  
 13  forecast_datetime                  object 
 14  direct_solar_radiation             float64
 15  surface_solar_radiation_downwards  float64
 16  snowfall                           float64
 17  total_precipitation                float64
dtypes: float64(14), int64(2), object(2)
memory usage: 470.3+ MB
In [119]:
forecast_weather.origin_datetime = pd.to_datetime(forecast_weather.origin_datetime)
forecast_weather.forecast_datetime = pd.to_datetime(forecast_weather.forecast_datetime)

forecast_weather.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3424512 entries, 0 to 3424511
Data columns (total 18 columns):
 #   Column                             Dtype              
---  ------                             -----              
 0   latitude                           float64            
 1   longitude                          float64            
 2   origin_datetime                    datetime64[ns, UTC]
 3   hours_ahead                        int64              
 4   temperature                        float64            
 5   dewpoint                           float64            
 6   cloudcover_high                    float64            
 7   cloudcover_low                     float64            
 8   cloudcover_mid                     float64            
 9   cloudcover_total                   float64            
 10  10_metre_u_wind_component          float64            
 11  10_metre_v_wind_component          float64            
 12  data_block_id                      int64              
 13  forecast_datetime                  datetime64[ns, UTC]
 14  direct_solar_radiation             float64            
 15  surface_solar_radiation_downwards  float64            
 16  snowfall                           float64            
 17  total_precipitation                float64            
dtypes: datetime64[ns, UTC](2), float64(14), int64(2)
memory usage: 470.3 MB
In [120]:
forecast_weather.nunique()
Out[120]:
latitude                                   8
longitude                                 14
origin_datetime                          637
hours_ahead                               48
temperature                           400895
dewpoint                              439461
cloudcover_high                       212178
cloudcover_low                        408609
cloudcover_mid                        308004
cloudcover_total                      698385
10_metre_u_wind_component            3166997
10_metre_v_wind_component            3142161
data_block_id                            637
forecast_datetime                      15312
direct_solar_radiation               1199360
surface_solar_radiation_downwards    1496773
snowfall                              256201
total_precipitation                   540661
dtype: int64
In [121]:
forecast_weather.groupby(['latitude', 'longitude']).nunique()
Out[121]:
origin_datetime hours_ahead temperature dewpoint cloudcover_high cloudcover_low cloudcover_mid cloudcover_total 10_metre_u_wind_component 10_metre_v_wind_component data_block_id forecast_datetime direct_solar_radiation surface_solar_radiation_downwards snowfall total_precipitation
latitude longitude
57.6 21.7 637 48 27706 28047 10902 14794 13125 16896 30563 30551 637 15312 14747 16068 4250 9540
22.2 637 48 28285 28292 11118 15006 13505 17130 30548 30550 637 15312 14555 16085 4389 9353
22.7 637 48 27799 28120 11000 14869 13640 17178 30554 30556 637 15312 14561 16074 4327 9261
23.2 637 48 27587 28120 10821 14289 13280 16737 30558 30556 637 15312 14670 16092 4415 9384
23.7 637 48 27697 28045 11010 14380 13396 16817 30561 30560 637 15312 14648 16084 4462 9540
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59.7 26.2 637 48 27959 28170 11356 14511 13565 16080 30561 30559 637 15312 14396 16066 5199 9652
26.7 637 48 28019 28199 11082 14412 13420 15878 30563 30560 637 15312 14375 16072 5204 9640
27.2 637 48 28033 28129 11387 14400 13519 15858 30551 30563 637 15312 14507 16078 5318 9743
27.7 637 48 28156 28283 10929 14201 13348 15712 30555 30556 637 15312 14401 16101 5466 9935
28.2 637 48 28495 28373 11102 14822 13779 16164 30542 30553 637 15312 14325 16128 5729 10107

112 rows × 16 columns

There are 112 unique combinations of lat and long (unique weather stations). 
So for each forecast_date, there are 112 observations (one from each station). 
In [122]:
display(forecast_weather.hours_ahead.nunique())
display(forecast_weather.origin_datetime.nunique())
display(forecast_weather.groupby('origin_datetime').forecast_datetime.size()/112)
48
637
origin_datetime
2021-09-01 00:00:00+00:00    48.0
2021-09-02 00:00:00+00:00    48.0
2021-09-03 00:00:00+00:00    48.0
2021-09-04 00:00:00+00:00    48.0
2021-09-05 00:00:00+00:00    48.0
                             ... 
2023-05-26 00:00:00+00:00    48.0
2023-05-27 00:00:00+00:00    48.0
2023-05-28 00:00:00+00:00    48.0
2023-05-29 00:00:00+00:00    48.0
2023-05-30 00:00:00+00:00    48.0
Name: forecast_datetime, Length: 637, dtype: float64
In [123]:
forecast_weather.groupby(['origin_datetime', 'forecast_datetime']).count()
Out[123]:
latitude longitude hours_ahead temperature dewpoint cloudcover_high cloudcover_low cloudcover_mid cloudcover_total 10_metre_u_wind_component 10_metre_v_wind_component data_block_id direct_solar_radiation surface_solar_radiation_downwards snowfall total_precipitation
origin_datetime forecast_datetime
2021-09-01 00:00:00+00:00 2021-09-01 01:00:00+00:00 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112
2021-09-01 02:00:00+00:00 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112
2021-09-01 03:00:00+00:00 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112
2021-09-01 04:00:00+00:00 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112
2021-09-01 05:00:00+00:00 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2023-05-30 00:00:00+00:00 2023-05-31 20:00:00+00:00 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112
2023-05-31 21:00:00+00:00 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112
2023-05-31 22:00:00+00:00 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112
2023-05-31 23:00:00+00:00 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112
2023-06-01 00:00:00+00:00 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112

30576 rows × 16 columns

Gas Prices Data¶

In [124]:
gas_prices = pd.read_csv('../data/gas_prices.csv')

gas_prices.head()
Out[124]:
forecast_date lowest_price_per_mwh highest_price_per_mwh origin_date data_block_id
0 2021-09-01 45.23 46.32 2021-08-31 1
1 2021-09-02 45.62 46.29 2021-09-01 2
2 2021-09-03 45.85 46.40 2021-09-02 3
3 2021-09-04 46.30 46.80 2021-09-03 4
4 2021-09-05 46.30 46.58 2021-09-04 5
In [125]:
gas_prices.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 637 entries, 0 to 636
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   forecast_date          637 non-null    object 
 1   lowest_price_per_mwh   637 non-null    float64
 2   highest_price_per_mwh  637 non-null    float64
 3   origin_date            637 non-null    object 
 4   data_block_id          637 non-null    int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 25.0+ KB
In [126]:
gas_prices.forecast_date = pd.to_datetime(gas_prices.forecast_date)
gas_prices.origin_date = pd.to_datetime(gas_prices.origin_date)

gas_prices.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 637 entries, 0 to 636
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   forecast_date          637 non-null    datetime64[ns]
 1   lowest_price_per_mwh   637 non-null    float64       
 2   highest_price_per_mwh  637 non-null    float64       
 3   origin_date            637 non-null    datetime64[ns]
 4   data_block_id          637 non-null    int64         
dtypes: datetime64[ns](2), float64(2), int64(1)
memory usage: 25.0 KB
In [127]:
gas_prices.describe()
Out[127]:
forecast_date lowest_price_per_mwh highest_price_per_mwh origin_date data_block_id
count 637 637.000000 637.000000 637 637.000000
mean 2022-07-16 00:00:00 95.036750 107.754631 2022-07-15 00:00:00 319.000000
min 2021-09-01 00:00:00 28.100000 34.000000 2021-08-31 00:00:00 1.000000
25% 2022-02-07 00:00:00 60.000000 67.530000 2022-02-06 00:00:00 160.000000
50% 2022-07-16 00:00:00 85.210000 93.470000 2022-07-15 00:00:00 319.000000
75% 2022-12-22 00:00:00 109.000000 130.740000 2022-12-21 00:00:00 478.000000
max 2023-05-30 00:00:00 250.000000 305.000000 2023-05-29 00:00:00 637.000000
std NaN 47.552295 54.743666 NaN 184.030342

Historical Weather Data¶

In [128]:
historical_weather = pd.read_csv('../data/historical_weather.csv')

historical_weather.head()
Out[128]:
datetime temperature dewpoint rain snowfall surface_pressure cloudcover_total cloudcover_low cloudcover_mid cloudcover_high windspeed_10m winddirection_10m shortwave_radiation direct_solar_radiation diffuse_radiation latitude longitude data_block_id
0 2021-09-01 00:00:00 14.4 12.0 0.0 0.0 1015.8 4 4 0 0 6.694444 3 0.0 0.0 0.0 57.6 21.7 1
1 2021-09-01 00:00:00 14.0 12.0 0.0 0.0 1010.6 7 8 0 0 4.944444 353 0.0 0.0 0.0 57.6 22.2 1
2 2021-09-01 00:00:00 14.4 12.8 0.0 0.0 1014.9 6 7 0 0 5.833333 348 0.0 0.0 0.0 57.6 22.7 1
3 2021-09-01 00:00:00 15.4 13.0 0.0 0.0 1014.4 4 2 4 0 7.111111 349 0.0 0.0 0.0 57.6 23.2 1
4 2021-09-01 00:00:00 15.9 12.6 0.0 0.0 1013.8 12 7 0 20 8.388889 360 0.0 0.0 0.0 57.6 23.7 1
In [129]:
historical_weather.datetime = pd.to_datetime(historical_weather.datetime)

historical_weather.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1710800 entries, 0 to 1710799
Data columns (total 18 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   datetime                datetime64[ns]
 1   temperature             float64       
 2   dewpoint                float64       
 3   rain                    float64       
 4   snowfall                float64       
 5   surface_pressure        float64       
 6   cloudcover_total        int64         
 7   cloudcover_low          int64         
 8   cloudcover_mid          int64         
 9   cloudcover_high         int64         
 10  windspeed_10m           float64       
 11  winddirection_10m       int64         
 12  shortwave_radiation     float64       
 13  direct_solar_radiation  float64       
 14  diffuse_radiation       float64       
 15  latitude                float64       
 16  longitude               float64       
 17  data_block_id           int64         
dtypes: datetime64[ns](1), float64(11), int64(6)
memory usage: 234.9 MB
In [130]:
historical_weather.isnull().sum()
Out[130]:
datetime                  0
temperature               0
dewpoint                  0
rain                      0
snowfall                  0
surface_pressure          0
cloudcover_total          0
cloudcover_low            0
cloudcover_mid            0
cloudcover_high           0
windspeed_10m             0
winddirection_10m         0
shortwave_radiation       0
direct_solar_radiation    0
diffuse_radiation         0
latitude                  0
longitude                 0
data_block_id             0
dtype: int64

Train Data & Checking for NULL values¶

In [131]:
train = pd.read_csv('../data/train.csv')

train.head()
Out[131]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id
0 0 0 1 0.713 0 2021-09-01 00:00:00 0 0 0
1 0 0 1 96.590 1 2021-09-01 00:00:00 0 1 0
2 0 0 2 0.000 0 2021-09-01 00:00:00 0 2 1
3 0 0 2 17.314 1 2021-09-01 00:00:00 0 3 1
4 0 0 3 2.904 0 2021-09-01 00:00:00 0 4 2
In [132]:
train.datetime = pd.to_datetime(train.datetime, format='%Y-%m-%d %H:%M:%S')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2018352 entries, 0 to 2018351
Data columns (total 9 columns):
 #   Column              Dtype         
---  ------              -----         
 0   county              int64         
 1   is_business         int64         
 2   product_type        int64         
 3   target              float64       
 4   is_consumption      int64         
 5   datetime            datetime64[ns]
 6   data_block_id       int64         
 7   row_id              int64         
 8   prediction_unit_id  int64         
dtypes: datetime64[ns](1), float64(1), int64(7)
memory usage: 138.6 MB
In [133]:
train.head()
Out[133]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id
0 0 0 1 0.713 0 2021-09-01 0 0 0
1 0 0 1 96.590 1 2021-09-01 0 1 0
2 0 0 2 0.000 0 2021-09-01 0 2 1
3 0 0 2 17.314 1 2021-09-01 0 3 1
4 0 0 3 2.904 0 2021-09-01 0 4 2
In [134]:
train.datetime[1000]
Out[134]:
Timestamp('2021-09-01 08:00:00')
In [135]:
train.isnull().sum()
Out[135]:
county                  0
is_business             0
product_type            0
target                528
is_consumption          0
datetime                0
data_block_id           0
row_id                  0
prediction_unit_id      0
dtype: int64
In [136]:
train.groupby('is_consumption').agg({'target': lambda x: x.isnull().sum()})
Out[136]:
target
is_consumption
0 264
1 264
In [137]:
train[train.target.isnull()]
Out[137]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id
178938 0 0 1 NaN 0 2021-10-31 03:00:00 60 178938 0
178939 0 0 1 NaN 1 2021-10-31 03:00:00 60 178939 0
178940 0 0 2 NaN 0 2021-10-31 03:00:00 60 178940 1
178941 0 0 2 NaN 1 2021-10-31 03:00:00 60 178941 1
178942 0 0 3 NaN 0 2021-10-31 03:00:00 60 178942 2
... ... ... ... ... ... ... ... ... ...
1806379 15 1 0 NaN 1 2023-03-26 03:00:00 571 1806379 64
1806380 15 1 1 NaN 0 2023-03-26 03:00:00 571 1806380 59
1806381 15 1 1 NaN 1 2023-03-26 03:00:00 571 1806381 59
1806382 15 1 3 NaN 0 2023-03-26 03:00:00 571 1806382 60
1806383 15 1 3 NaN 1 2023-03-26 03:00:00 571 1806383 60

528 rows × 9 columns

There seems to be no pattern in the null values in target column. This indicates we can drop these rows
In [138]:
weather_station_to_county_mapping = pd.read_csv('../data/weather_station_to_county_mapping.csv')

weather_station_to_county_mapping.head()
Out[138]:
county_name longitude latitude county
0 NaN 21.7 57.6 NaN
1 NaN 21.7 57.9 NaN
2 NaN 21.7 58.2 NaN
3 NaN 21.7 58.5 NaN
4 NaN 21.7 58.8 NaN
In [139]:
weather_station_to_county_mapping.county.unique()
Out[139]:
array([nan, 10.,  1.,  7.,  6.,  9.,  0., 14.,  3., 13., 11.,  4.,  5.,
       15.,  8.,  2.])
In [140]:
weather_station_to_county_mapping.county_name.unique()
Out[140]:
array([nan, 'Saaremaa', 'Hiiumaa', 'Pärnumaa', 'Läänemaa', 'Raplamaa',
       'Harjumaa', 'Viljandimaa', 'Järvamaa', 'Valgamaa', 'Tartumaa',
       'Jõgevamaa', 'Lääne-Virumaa', 'Võrumaa', 'Põlvamaa', 'Ida-Virumaa'],
      dtype=object)

https://www.kaggle.com/code/fabiendaniel/mapping-locations-and-county-codes/notebook for county codes Here, they remove the 'maa' appendix from the county names. but is this really needed?

Data Merging (now we merge everything to train)¶

In [141]:
train.head()
Out[141]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id
0 0 0 1 0.713 0 2021-09-01 0 0 0
1 0 0 1 96.590 1 2021-09-01 0 1 0
2 0 0 2 0.000 0 2021-09-01 0 2 1
3 0 0 2 17.314 1 2021-09-01 0 3 1
4 0 0 3 2.904 0 2021-09-01 0 4 2

Merge Client¶

In [142]:
client.head()
Out[142]:
product_type county eic_count installed_capacity is_business date data_block_id
0 1 0 108 952.89 0 2021-09-01 2
1 2 0 17 166.40 0 2021-09-01 2
2 3 0 688 7207.88 0 2021-09-01 2
3 0 0 5 400.00 1 2021-09-01 2
4 1 0 43 1411.00 1 2021-09-01 2
In [143]:
print(client.size, train.size)
293433 18165168
In [144]:
# append '_client' to merged columns
client.columns = [f"{column}_client" if column not in ['data_block_id', 'county', 'is_business', 'product_type'] else column for column in client.columns]
In [145]:
# merge train and client

merged_df = pd.merge(train, client, on=['data_block_id', 'county', 'is_business', 'product_type'], how='left')

merged_df.head()
Out[145]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id eic_count_client installed_capacity_client date_client
0 0 0 1 0.713 0 2021-09-01 0 0 0 NaN NaN NaT
1 0 0 1 96.590 1 2021-09-01 0 1 0 NaN NaN NaT
2 0 0 2 0.000 0 2021-09-01 0 2 1 NaN NaN NaT
3 0 0 2 17.314 1 2021-09-01 0 3 1 NaN NaN NaT
4 0 0 3 2.904 0 2021-09-01 0 4 2 NaN NaN NaT

there are a few/a lot null values, especially at the beginning and end of period

In [146]:
# how many eic counts per data_block_id?
merged_df[merged_df.eic_count_client.isnull()].data_block_id.value_counts()
Out[146]:
data_block_id
0      2928
1      2928
30      144
31      144
122     144
123     144
154      96
607      96
153      96
584      48
540      48
533      48
541      48
572      48
573      48
580      48
583      48
596      48
590      48
591      48
595      48
519      48
602      48
603      48
606      48
608      48
611      48
532      48
458      48
518      48
514      48
91       48
92       48
172      48
173      48
203      48
204      48
215      48
216      48
273      48
274      48
365      48
366      48
395      48
396      48
426      48
427      48
513      48
612      48
Name: count, dtype: int64
In [147]:
# do dates in train and client overlap? 
print(set(client.date_client.dt.date) ^ set(train.datetime.dt.date))
print(set(train.data_block_id) ^ set(client.data_block_id))
{datetime.date(2023, 5, 30), datetime.date(2023, 5, 31)}
{0, 1}
In [148]:
def print_block(dbid):
    display("TRAIN", train[train['data_block_id'] == dbid])
#     display("FORC WEATHER", forecast_weather[forecast_weather['data_block_id'] == dbid])
    display("CLIENT", client[client['data_block_id'] == dbid])
#     display("HIST WEATHER", historical_weather[historical_weather['data_block_id'] == dbid])
#     display("E PRICES", electricity_prices[electricity_prices['data_block_id'] == dbid])
#     display("G PRICES", gas_prices[gas_prices['data_block_id'] == dbid])

print_block(10)
'TRAIN'
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id
29280 0 0 1 0.000 0 2021-09-11 00:00:00 10 29280 0
29281 0 0 1 109.095 1 2021-09-11 00:00:00 10 29281 0
29282 0 0 2 0.000 0 2021-09-11 00:00:00 10 29282 1
29283 0 0 2 17.503 1 2021-09-11 00:00:00 10 29283 1
29284 0 0 3 0.077 0 2021-09-11 00:00:00 10 29284 2
... ... ... ... ... ... ... ... ... ...
32203 15 0 3 76.522 1 2021-09-11 23:00:00 10 32203 58
32204 15 1 1 0.100 0 2021-09-11 23:00:00 10 32204 59
32205 15 1 1 17.543 1 2021-09-11 23:00:00 10 32205 59
32206 15 1 3 0.000 0 2021-09-11 23:00:00 10 32206 60
32207 15 1 3 292.390 1 2021-09-11 23:00:00 10 32207 60

2928 rows × 9 columns

'CLIENT'
product_type county eic_count_client installed_capacity_client is_business date_client data_block_id
488 1 0 108 952.89 0 2021-09-09 10
489 2 0 17 166.40 0 2021-09-09 10
490 3 0 687 7199.88 0 2021-09-09 10
491 0 0 5 400.00 1 2021-09-09 10
492 1 0 43 1411.00 1 2021-09-09 10
... ... ... ... ... ... ... ...
544 3 14 67 3114.60 1 2021-09-09 10
545 1 15 10 83.20 0 2021-09-09 10
546 3 15 61 918.20 0 2021-09-09 10
547 1 15 7 325.00 1 2021-09-09 10
548 3 15 49 1778.70 1 2021-09-09 10

61 rows × 7 columns

There is a problem with NULL values after merging. one source are the start and end dates, but we don't know whats happening in between and whether this is problematic. Maybe some client data is sporadically missing?

Merge Gas Prices¶

In [149]:
gas_prices.head()
Out[149]:
forecast_date lowest_price_per_mwh highest_price_per_mwh origin_date data_block_id
0 2021-09-01 45.23 46.32 2021-08-31 1
1 2021-09-02 45.62 46.29 2021-09-01 2
2 2021-09-03 45.85 46.40 2021-09-02 3
3 2021-09-04 46.30 46.80 2021-09-03 4
4 2021-09-05 46.30 46.58 2021-09-04 5
In [150]:
# append _gas_prices to columns
gas_prices.columns = [f"{column}_gas_prices" if column != 'data_block_id' else column for column in gas_prices.columns]
In [151]:
# merge gas_prices

merged_df = pd.merge(merged_df, gas_prices, on=['data_block_id'], how='left')

merged_df.tail()
Out[151]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id eic_count_client installed_capacity_client date_client forecast_date_gas_prices lowest_price_per_mwh_gas_prices highest_price_per_mwh_gas_prices origin_date_gas_prices
2018347 15 1 0 197.233 1 2023-05-31 23:00:00 637 2018347 64 15.0 620.0 2023-05-29 2023-05-30 29.0 34.0 2023-05-29
2018348 15 1 1 0.000 0 2023-05-31 23:00:00 637 2018348 59 20.0 624.5 2023-05-29 2023-05-30 29.0 34.0 2023-05-29
2018349 15 1 1 28.404 1 2023-05-31 23:00:00 637 2018349 59 20.0 624.5 2023-05-29 2023-05-30 29.0 34.0 2023-05-29
2018350 15 1 3 0.000 0 2023-05-31 23:00:00 637 2018350 60 55.0 2188.2 2023-05-29 2023-05-30 29.0 34.0 2023-05-29
2018351 15 1 3 196.240 1 2023-05-31 23:00:00 637 2018351 60 55.0 2188.2 2023-05-29 2023-05-30 29.0 34.0 2023-05-29

Merge Electricity Prices¶

In [152]:
# add time column for merging with electricity data
merged_df['time_of_day'] = merged_df['datetime'].dt.time
merged_df.head()
Out[152]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id eic_count_client installed_capacity_client date_client forecast_date_gas_prices lowest_price_per_mwh_gas_prices highest_price_per_mwh_gas_prices origin_date_gas_prices time_of_day
0 0 0 1 0.713 0 2021-09-01 0 0 0 NaN NaN NaT NaT NaN NaN NaT 00:00:00
1 0 0 1 96.590 1 2021-09-01 0 1 0 NaN NaN NaT NaT NaN NaN NaT 00:00:00
2 0 0 2 0.000 0 2021-09-01 0 2 1 NaN NaN NaT NaT NaN NaN NaT 00:00:00
3 0 0 2 17.314 1 2021-09-01 0 3 1 NaN NaN NaT NaT NaN NaN NaT 00:00:00
4 0 0 3 2.904 0 2021-09-01 0 4 2 NaN NaN NaT NaT NaN NaN NaT 00:00:00
In [153]:
# Merge electricity prices
# the prices are available hourly -> create new column with time 

electricity_prices['time_of_day'] = electricity_prices.forecast_date.dt.time

# append electricity_prices to column names
electricity_prices.columns = [f"{column}_electricity_prices" if column not in ['time_of_day','data_block_id'] else column for column in electricity_prices.columns]

Merge Electricity Prices¶

In [154]:
# merge electricity_prices

merged_df = pd.merge(merged_df, electricity_prices, on = ['data_block_id', 'time_of_day'], how='left')

merged_df.head()
Out[154]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id eic_count_client installed_capacity_client date_client forecast_date_gas_prices lowest_price_per_mwh_gas_prices highest_price_per_mwh_gas_prices origin_date_gas_prices time_of_day forecast_date_electricity_prices euros_per_mwh_electricity_prices origin_date_electricity_prices
0 0 0 1 0.713 0 2021-09-01 0 0 0 NaN NaN NaT NaT NaN NaN NaT 00:00:00 NaT NaN NaT
1 0 0 1 96.590 1 2021-09-01 0 1 0 NaN NaN NaT NaT NaN NaN NaT 00:00:00 NaT NaN NaT
2 0 0 2 0.000 0 2021-09-01 0 2 1 NaN NaN NaT NaT NaN NaN NaT 00:00:00 NaT NaN NaT
3 0 0 2 17.314 1 2021-09-01 0 3 1 NaN NaN NaT NaT NaN NaN NaT 00:00:00 NaT NaN NaT
4 0 0 3 2.904 0 2021-09-01 0 4 2 NaN NaN NaT NaT NaN NaN NaT 00:00:00 NaT NaN NaT

Merge Historical Weather¶

In [155]:
# historic weather

historical_weather.head()
Out[155]:
datetime temperature dewpoint rain snowfall surface_pressure cloudcover_total cloudcover_low cloudcover_mid cloudcover_high windspeed_10m winddirection_10m shortwave_radiation direct_solar_radiation diffuse_radiation latitude longitude data_block_id
0 2021-09-01 14.4 12.0 0.0 0.0 1015.8 4 4 0 0 6.694444 3 0.0 0.0 0.0 57.6 21.7 1
1 2021-09-01 14.0 12.0 0.0 0.0 1010.6 7 8 0 0 4.944444 353 0.0 0.0 0.0 57.6 22.2 1
2 2021-09-01 14.4 12.8 0.0 0.0 1014.9 6 7 0 0 5.833333 348 0.0 0.0 0.0 57.6 22.7 1
3 2021-09-01 15.4 13.0 0.0 0.0 1014.4 4 2 4 0 7.111111 349 0.0 0.0 0.0 57.6 23.2 1
4 2021-09-01 15.9 12.6 0.0 0.0 1013.8 12 7 0 20 8.388889 360 0.0 0.0 0.0 57.6 23.7 1
In [156]:
# get county and county_name from weather_station_to_county_mapping (merge on latitude and longitude)

# round lat and long to avoid mismatching due to different accuracy
historical_weather.latitude = historical_weather.latitude.round(1)
historical_weather.longitude = historical_weather.longitude.round(1)

weather_station_to_county_mapping.latitude = weather_station_to_county_mapping.latitude.round(1)
weather_station_to_county_mapping.longitude = weather_station_to_county_mapping.longitude.round(1)
In [157]:
# merge historical weather to get counties
merged_hist_weather = pd.merge(historical_weather, weather_station_to_county_mapping, on=['latitude', 'longitude'], how='left')
# get time of day
merged_hist_weather['time_of_day'] = merged_hist_weather['datetime'].dt.time

# aggregate by county and time (summarize weather stations for same county)
merged_hist_weather = merged_hist_weather.groupby(['county', 'time_of_day', 'datetime', 'data_block_id']).mean(numeric_only=True).reset_index()

# append _hist_weather to column names
merged_hist_weather.columns = [f"{column}_hist_weather" if column not in ['county', 'time_of_day','data_block_id'] else column for column in merged_hist_weather.columns]


merged_hist_weather.sample()
Out[157]:
county time_of_day datetime_hist_weather data_block_id temperature_hist_weather dewpoint_hist_weather rain_hist_weather snowfall_hist_weather surface_pressure_hist_weather cloudcover_total_hist_weather cloudcover_low_hist_weather cloudcover_mid_hist_weather cloudcover_high_hist_weather windspeed_10m_hist_weather winddirection_10m_hist_weather shortwave_radiation_hist_weather direct_solar_radiation_hist_weather diffuse_radiation_hist_weather latitude_hist_weather longitude_hist_weather
223220 15.0 14:00:00 2022-11-30 14:00:00 457 -2.78 -4.76 0.0 0.056 1021.42 100.0 94.6 83.6 0.4 2.211111 313.4 68.2 1.4 66.8 57.78 27.1
In [158]:
# merge to merged_df
merged_df = pd.merge(merged_df, merged_hist_weather, on=['data_block_id', 'time_of_day', 'county'], how='left')

Merge Forecast Weather¶

In [159]:
forecast_weather.head()
Out[159]:
latitude longitude origin_datetime hours_ahead temperature dewpoint cloudcover_high cloudcover_low cloudcover_mid cloudcover_total 10_metre_u_wind_component 10_metre_v_wind_component data_block_id forecast_datetime direct_solar_radiation surface_solar_radiation_downwards snowfall total_precipitation
0 57.6 21.7 2021-09-01 00:00:00+00:00 1 15.655786 11.553613 0.904816 0.019714 0.000000 0.905899 -0.411328 -9.106137 1 2021-09-01 01:00:00+00:00 0.0 0.0 0.0 0.0
1 57.6 22.2 2021-09-01 00:00:00+00:00 1 13.003931 10.689844 0.886322 0.004456 0.000000 0.886658 0.206347 -5.355405 1 2021-09-01 01:00:00+00:00 0.0 0.0 0.0 0.0
2 57.6 22.7 2021-09-01 00:00:00+00:00 1 14.206567 11.671777 0.729034 0.005615 0.000000 0.730499 1.451587 -7.417905 1 2021-09-01 01:00:00+00:00 0.0 0.0 0.0 0.0
3 57.6 23.2 2021-09-01 00:00:00+00:00 1 14.844507 12.264917 0.336304 0.074341 0.000626 0.385468 1.090869 -9.163999 1 2021-09-01 01:00:00+00:00 0.0 0.0 0.0 0.0
4 57.6 23.7 2021-09-01 00:00:00+00:00 1 15.293848 12.458887 0.102875 0.088074 0.000015 0.176590 1.268481 -8.975766 1 2021-09-01 01:00:00+00:00 0.0 0.0 0.0 0.0
In [160]:
# forecast weather

#round lat and long
forecast_weather.latitude = forecast_weather.latitude.round(1)
forecast_weather.longitude = forecast_weather.longitude.round(1)

# merge to get counties
merged_forecast_weather = pd.merge(forecast_weather, weather_station_to_county_mapping, on=['latitude', 'longitude'], how='left')
# merged_forecast_weather['time_of_day'] = merged_forecast_weather.

# # aggregate for duplicate locations
merged_forecast_weather = merged_forecast_weather.groupby(['county', 'forecast_datetime', 'data_block_id']).mean(numeric_only=True).reset_index()

# append forecast_weather to column names
merged_forecast_weather.columns = [f"{column}_forecast_weather" if column not in ['county', 'forecast_datetime','data_block_id'] else column for column in merged_forecast_weather.columns]


merged_forecast_weather.sample()
Out[160]:
county forecast_datetime data_block_id latitude_forecast_weather longitude_forecast_weather hours_ahead_forecast_weather temperature_forecast_weather dewpoint_forecast_weather cloudcover_high_forecast_weather cloudcover_low_forecast_weather cloudcover_mid_forecast_weather cloudcover_total_forecast_weather 10_metre_u_wind_component_forecast_weather 10_metre_v_wind_component_forecast_weather direct_solar_radiation_forecast_weather surface_solar_radiation_downwards_forecast_weather snowfall_forecast_weather total_precipitation_forecast_weather
337175 11.0 2021-09-19 00:00:00+00:00 18 58.35 26.7 24.0 6.82879 2.614771 0.309937 0.300819 0.0 0.506729 -4.96759 -2.603552 0.0 0.0 0.0 0.0
In [161]:
merged_df.datetime.dt.tz_localize('EET', ambiguous=True, nonexistent='shift_forward')[80000]
Out[161]:
Timestamp('2021-09-28 07:00:00+0300', tz='EET')
In [162]:
merged_df.datetime.dt.tz_localize('EET', ambiguous=True, nonexistent='shift_forward')[1500000]
Out[162]:
Timestamp('2022-12-20 07:00:00+0200', tz='EET')
In [163]:
merged_forecast_weather.forecast_datetime.dt.tz_convert('EET')[6000]
Out[163]:
Timestamp('2022-01-04 15:00:00+0200', tz='EET')
In [164]:
# add EET timezone to datetime, and handle daylight-savings
merged_df['datetime_localized'] = merged_df.datetime.dt.tz_localize('EET', ambiguous=True, nonexistent='shift_forward')

# convert UTC timezone to EET timezone in forecast weather
merged_forecast_weather['datetime_EET']  = merged_forecast_weather.forecast_datetime.dt.tz_convert('EET')
In [165]:
merged_forecast_weather.query('data_block_id == 300')
Out[165]:
county forecast_datetime data_block_id latitude_forecast_weather longitude_forecast_weather hours_ahead_forecast_weather temperature_forecast_weather dewpoint_forecast_weather cloudcover_high_forecast_weather cloudcover_low_forecast_weather cloudcover_mid_forecast_weather cloudcover_total_forecast_weather 10_metre_u_wind_component_forecast_weather 10_metre_v_wind_component_forecast_weather direct_solar_radiation_forecast_weather surface_solar_radiation_downwards_forecast_weather snowfall_forecast_weather total_precipitation_forecast_weather datetime_EET
14329 0.0 2022-06-27 01:00:00+00:00 300 59.30 24.866667 1.0 17.726831 15.787215 0.000000 0.003357 0.008809 0.009867 -1.264754 1.363535 0.000000 0.000000 0.0 0.000000 2022-06-27 04:00:00+03:00
14331 0.0 2022-06-27 02:00:00+00:00 300 59.30 24.866667 2.0 17.934391 16.112492 0.000000 0.003840 0.001831 0.005351 -1.073307 1.067064 76.571599 9.877279 0.0 0.000000 2022-06-27 05:00:00+03:00
14333 0.0 2022-06-27 03:00:00+00:00 300 59.30 24.866667 3.0 19.487288 16.575301 0.000000 0.000000 0.000244 0.000244 -1.185901 1.050166 289.486844 68.890875 0.0 0.000000 2022-06-27 06:00:00+03:00
14335 0.0 2022-06-27 04:00:00+00:00 300 59.30 24.866667 4.0 21.966781 17.163924 0.000000 0.000262 0.000280 0.000539 -1.105384 0.556716 488.340764 166.954867 0.0 0.000000 2022-06-27 07:00:00+03:00
14337 0.0 2022-06-27 05:00:00+00:00 300 59.30 24.866667 5.0 24.450098 17.612248 0.000000 0.008311 0.000163 0.008382 -1.017414 0.301280 621.634103 286.280498 0.0 0.000000 2022-06-27 08:00:00+03:00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
442478 15.0 2022-06-28 20:00:00+00:00 300 57.78 27.100000 44.0 22.470117 19.469775 0.528113 0.015137 0.058371 0.556158 -1.712890 0.529172 2.514667 0.056000 0.0 0.000000 2022-06-28 23:00:00+03:00
442480 15.0 2022-06-28 21:00:00+00:00 300 57.78 27.100000 45.0 21.575977 19.339404 0.370251 0.031049 0.076743 0.419739 -1.573499 0.936665 0.000000 0.000000 0.0 0.000000 2022-06-29 00:00:00+03:00
442482 15.0 2022-06-28 22:00:00+00:00 300 57.78 27.100000 46.0 20.878320 19.020068 0.509418 0.072824 0.216644 0.618463 -1.143979 0.731405 0.000000 0.000000 0.0 0.000464 2022-06-29 01:00:00+03:00
442484 15.0 2022-06-28 23:00:00+00:00 300 57.78 27.100000 47.0 20.642871 18.551855 0.942194 0.114362 0.345221 0.961719 -1.657865 -0.578932 0.000000 0.000000 0.0 0.001295 2022-06-29 02:00:00+03:00
442486 15.0 2022-06-29 00:00:00+00:00 300 57.78 27.100000 48.0 19.568115 17.740234 0.954700 0.055295 0.261322 0.957941 -2.176746 0.905358 0.000000 0.000000 0.0 0.001783 2022-06-29 03:00:00+03:00

720 rows × 19 columns

In [166]:
# merge forecast_weather
merged_df = pd.merge(merged_df, merged_forecast_weather, left_on=['data_block_id', 'datetime_localized', 'county'], right_on=['data_block_id', 'datetime_EET', 'county'], how='left')

merged_df.sample(10)
Out[166]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id eic_count_client ... cloudcover_low_forecast_weather cloudcover_mid_forecast_weather cloudcover_total_forecast_weather 10_metre_u_wind_component_forecast_weather 10_metre_v_wind_component_forecast_weather direct_solar_radiation_forecast_weather surface_solar_radiation_downwards_forecast_weather snowfall_forecast_weather total_precipitation_forecast_weather datetime_EET
667583 5 0 1 79.567 1 2022-04-06 06:00:00 217 667583 19 35.0 ... 0.149887 0.448029 0.499992 3.157317 2.655265 0.000000 0.000000 0.000181 0.000181 2022-04-06 06:00:00+03:00
1899807 15 1 3 277.765 1 2023-04-24 22:00:00 600 1899807 60 50.0 ... 0.034052 0.176721 0.201703 -0.103642 3.252309 -0.012444 0.000000 0.000000 0.000000 2023-04-24 22:00:00+03:00
1322989 5 1 1 115.854 1 2022-10-27 05:00:00 421 1322989 22 14.0 ... 0.999993 0.000008 0.999997 3.237482 -1.332670 0.000000 0.000000 0.000000 0.000026 2022-10-27 05:00:00+03:00
1665732 8 1 3 0.023 0 2023-02-09 23:00:00 526 1665732 33 34.0 ... 0.001678 1.000000 1.000006 4.558232 4.837220 0.000000 0.000000 0.000004 0.000005 2023-02-09 23:00:00+02:00
337989 4 1 0 147.829 1 2021-12-22 17:00:00 112 337989 17 6.0 ... 1.000000 0.000000 1.000000 2.387765 -2.303231 0.000000 0.000000 0.000016 0.000014 2021-12-22 17:00:00+02:00
637088 9 1 1 0.370 0 2022-03-27 19:00:00 207 637088 36 14.0 ... 0.996928 1.000005 1.000000 1.707896 2.316880 2.986667 42.040000 0.000056 0.000052 2022-03-27 19:00:00+03:00
118581 15 1 1 40.916 1 2021-10-11 03:00:00 40 118581 59 8.0 ... 0.000000 0.000601 0.628912 2.137020 2.982671 0.000000 0.000000 0.000000 0.000000 2021-10-11 03:00:00+03:00
1275251 5 1 0 285.550 1 2022-10-12 14:00:00 406 1275251 21 7.0 ... 0.101276 0.143452 0.204136 3.546041 -1.241343 694.192778 347.429722 0.000000 0.000047 2022-10-12 14:00:00+03:00
1298896 3 1 1 0.000 0 2022-10-19 20:00:00 413 1298896 13 12.0 ... 0.387380 0.422170 0.986809 2.940112 -2.279040 0.000000 0.000000 0.000000 0.000000 2022-10-19 20:00:00+03:00
1432208 7 1 3 0.319 0 2022-11-29 06:00:00 454 1432208 30 63.0 ... 1.000004 0.317584 1.000000 -3.999664 1.836443 0.000000 0.000000 0.000007 0.000009 2022-11-29 06:00:00+02:00

10 rows × 55 columns

In [167]:
merged_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2018352 entries, 0 to 2018351
Data columns (total 55 columns):
 #   Column                                              Dtype              
---  ------                                              -----              
 0   county                                              int64              
 1   is_business                                         int64              
 2   product_type                                        int64              
 3   target                                              float64            
 4   is_consumption                                      int64              
 5   datetime                                            datetime64[ns]     
 6   data_block_id                                       int64              
 7   row_id                                              int64              
 8   prediction_unit_id                                  int64              
 9   eic_count_client                                    float64            
 10  installed_capacity_client                           float64            
 11  date_client                                         datetime64[ns]     
 12  forecast_date_gas_prices                            datetime64[ns]     
 13  lowest_price_per_mwh_gas_prices                     float64            
 14  highest_price_per_mwh_gas_prices                    float64            
 15  origin_date_gas_prices                              datetime64[ns]     
 16  time_of_day                                         object             
 17  forecast_date_electricity_prices                    datetime64[ns]     
 18  euros_per_mwh_electricity_prices                    float64            
 19  origin_date_electricity_prices                      datetime64[ns]     
 20  datetime_hist_weather                               datetime64[ns]     
 21  temperature_hist_weather                            float64            
 22  dewpoint_hist_weather                               float64            
 23  rain_hist_weather                                   float64            
 24  snowfall_hist_weather                               float64            
 25  surface_pressure_hist_weather                       float64            
 26  cloudcover_total_hist_weather                       float64            
 27  cloudcover_low_hist_weather                         float64            
 28  cloudcover_mid_hist_weather                         float64            
 29  cloudcover_high_hist_weather                        float64            
 30  windspeed_10m_hist_weather                          float64            
 31  winddirection_10m_hist_weather                      float64            
 32  shortwave_radiation_hist_weather                    float64            
 33  direct_solar_radiation_hist_weather                 float64            
 34  diffuse_radiation_hist_weather                      float64            
 35  latitude_hist_weather                               float64            
 36  longitude_hist_weather                              float64            
 37  datetime_localized                                  datetime64[ns, EET]
 38  forecast_datetime                                   datetime64[ns, UTC]
 39  latitude_forecast_weather                           float64            
 40  longitude_forecast_weather                          float64            
 41  hours_ahead_forecast_weather                        float64            
 42  temperature_forecast_weather                        float64            
 43  dewpoint_forecast_weather                           float64            
 44  cloudcover_high_forecast_weather                    float64            
 45  cloudcover_low_forecast_weather                     float64            
 46  cloudcover_mid_forecast_weather                     float64            
 47  cloudcover_total_forecast_weather                   float64            
 48  10_metre_u_wind_component_forecast_weather          float64            
 49  10_metre_v_wind_component_forecast_weather          float64            
 50  direct_solar_radiation_forecast_weather             float64            
 51  surface_solar_radiation_downwards_forecast_weather  float64            
 52  snowfall_forecast_weather                           float64            
 53  total_precipitation_forecast_weather                float64            
 54  datetime_EET                                        datetime64[ns, EET]
dtypes: datetime64[ns, EET](2), datetime64[ns, UTC](1), datetime64[ns](7), float64(37), int64(7), object(1)
memory usage: 846.9+ MB
In [168]:
merged_df.size
Out[168]:
111009360

Checking for NULL values on merged data¶

In [169]:
merged_df.isnull().sum()
Out[169]:
county                                                    0
is_business                                               0
product_type                                              0
target                                                  528
is_consumption                                            0
datetime                                                  0
data_block_id                                             0
row_id                                                    0
prediction_unit_id                                        0
eic_count_client                                       8640
installed_capacity_client                              8640
date_client                                            8640
forecast_date_gas_prices                               2928
lowest_price_per_mwh_gas_prices                        2928
highest_price_per_mwh_gas_prices                       2928
origin_date_gas_prices                                 2928
time_of_day                                               0
forecast_date_electricity_prices                       3196
euros_per_mwh_electricity_prices                       3196
origin_date_electricity_prices                         3196
datetime_hist_weather                                 35064
temperature_hist_weather                              35064
dewpoint_hist_weather                                 35064
rain_hist_weather                                     35064
snowfall_hist_weather                                 35064
surface_pressure_hist_weather                         35064
cloudcover_total_hist_weather                         35064
cloudcover_low_hist_weather                           35064
cloudcover_mid_hist_weather                           35064
cloudcover_high_hist_weather                          35064
windspeed_10m_hist_weather                            35064
winddirection_10m_hist_weather                        35064
shortwave_radiation_hist_weather                      35064
direct_solar_radiation_hist_weather                   35064
diffuse_radiation_hist_weather                        35064
latitude_hist_weather                                 35064
longitude_hist_weather                                35064
datetime_localized                                        0
forecast_datetime                                     33504
latitude_forecast_weather                             33504
longitude_forecast_weather                            33504
hours_ahead_forecast_weather                          33504
temperature_forecast_weather                          33504
dewpoint_forecast_weather                             33504
cloudcover_high_forecast_weather                      33504
cloudcover_low_forecast_weather                       33504
cloudcover_mid_forecast_weather                       33504
cloudcover_total_forecast_weather                     33504
10_metre_u_wind_component_forecast_weather            33504
10_metre_v_wind_component_forecast_weather            33504
direct_solar_radiation_forecast_weather               33504
surface_solar_radiation_downwards_forecast_weather    33504
snowfall_forecast_weather                             33504
total_precipitation_forecast_weather                  33504
datetime_EET                                          33504
dtype: int64
In [170]:
merged_df.query('data_block_id != 0 and data_block_id != 1').isnull().sum()
Out[170]:
county                                                    0
is_business                                               0
product_type                                              0
target                                                  528
is_consumption                                            0
datetime                                                  0
data_block_id                                             0
row_id                                                    0
prediction_unit_id                                        0
eic_count_client                                       2784
installed_capacity_client                              2784
date_client                                            2784
forecast_date_gas_prices                                  0
lowest_price_per_mwh_gas_prices                           0
highest_price_per_mwh_gas_prices                          0
origin_date_gas_prices                                    0
time_of_day                                               0
forecast_date_electricity_prices                        268
euros_per_mwh_electricity_prices                        268
origin_date_electricity_prices                          268
datetime_hist_weather                                 30528
temperature_hist_weather                              30528
dewpoint_hist_weather                                 30528
rain_hist_weather                                     30528
snowfall_hist_weather                                 30528
surface_pressure_hist_weather                         30528
cloudcover_total_hist_weather                         30528
cloudcover_low_hist_weather                           30528
cloudcover_mid_hist_weather                           30528
cloudcover_high_hist_weather                          30528
windspeed_10m_hist_weather                            30528
winddirection_10m_hist_weather                        30528
shortwave_radiation_hist_weather                      30528
direct_solar_radiation_hist_weather                   30528
diffuse_radiation_hist_weather                        30528
latitude_hist_weather                                 30528
longitude_hist_weather                                30528
datetime_localized                                        0
forecast_datetime                                     30528
latitude_forecast_weather                             30528
longitude_forecast_weather                            30528
hours_ahead_forecast_weather                          30528
temperature_forecast_weather                          30528
dewpoint_forecast_weather                             30528
cloudcover_high_forecast_weather                      30528
cloudcover_low_forecast_weather                       30528
cloudcover_mid_forecast_weather                       30528
cloudcover_total_forecast_weather                     30528
10_metre_u_wind_component_forecast_weather            30528
10_metre_v_wind_component_forecast_weather            30528
direct_solar_radiation_forecast_weather               30528
surface_solar_radiation_downwards_forecast_weather    30528
snowfall_forecast_weather                             30528
total_precipitation_forecast_weather                  30528
datetime_EET                                          30528
dtype: int64
In [171]:
merged_df.query('data_block_id != 0 and data_block_id != 1')[merged_df.query('data_block_id != 0 and data_block_id != 1').isnull()].data_block_id
Out[171]:
5856      NaN
5857      NaN
5858      NaN
5859      NaN
5860      NaN
           ..
2018347   NaN
2018348   NaN
2018349   NaN
2018350   NaN
2018351   NaN
Name: data_block_id, Length: 2012496, dtype: float64
In [172]:
#sns.scatterplot(data=merged_df, x='datetime', y= 'temperature_hist_weather', hue='county')
In [173]:
import missingno as msno
In [174]:
merged_df.head()
Out[174]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id eic_count_client ... cloudcover_low_forecast_weather cloudcover_mid_forecast_weather cloudcover_total_forecast_weather 10_metre_u_wind_component_forecast_weather 10_metre_v_wind_component_forecast_weather direct_solar_radiation_forecast_weather surface_solar_radiation_downwards_forecast_weather snowfall_forecast_weather total_precipitation_forecast_weather datetime_EET
0 0 0 1 0.713 0 2021-09-01 0 0 0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
1 0 0 1 96.590 1 2021-09-01 0 1 0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
2 0 0 2 0.000 0 2021-09-01 0 2 1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
3 0 0 2 17.314 1 2021-09-01 0 3 1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
4 0 0 3 2.904 0 2021-09-01 0 4 2 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT

5 rows × 55 columns

In [175]:
msno.matrix(merged_df)
Out[175]:
<Axes: >
No description has been provided for this image
In [176]:
msno.matrix(merged_df.query('data_block_id != 0 and data_block_id != 1'))
Out[176]:
<Axes: >
No description has been provided for this image
In [177]:
msno.bar(merged_df)
Out[177]:
<Axes: >
No description has been provided for this image
In [178]:
merged_df[(merged_df.temperature_hist_weather.isnull()) & (merged_df.data_block_id == 200)]

# .datetime.dt.date.unique()
Out[178]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id eic_count_client ... cloudcover_low_forecast_weather cloudcover_mid_forecast_weather cloudcover_total_forecast_weather 10_metre_u_wind_component_forecast_weather 10_metre_v_wind_component_forecast_weather direct_solar_radiation_forecast_weather surface_solar_radiation_downwards_forecast_weather snowfall_forecast_weather total_precipitation_forecast_weather datetime_EET
612200 12 1 3 1.250 0 2022-03-20 00:00:00 200 612200 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612201 12 1 3 324.651 1 2022-03-20 00:00:00 200 612201 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612332 12 1 3 1.250 0 2022-03-20 01:00:00 200 612332 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612333 12 1 3 324.552 1 2022-03-20 01:00:00 200 612333 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612464 12 1 3 1.200 0 2022-03-20 02:00:00 200 612464 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612465 12 1 3 321.995 1 2022-03-20 02:00:00 200 612465 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612596 12 1 3 1.250 0 2022-03-20 03:00:00 200 612596 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612597 12 1 3 321.500 1 2022-03-20 03:00:00 200 612597 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612728 12 1 3 1.200 0 2022-03-20 04:00:00 200 612728 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612729 12 1 3 321.716 1 2022-03-20 04:00:00 200 612729 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612860 12 1 3 1.300 0 2022-03-20 05:00:00 200 612860 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612861 12 1 3 322.387 1 2022-03-20 05:00:00 200 612861 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612992 12 1 3 1.100 0 2022-03-20 06:00:00 200 612992 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
612993 12 1 3 311.407 1 2022-03-20 06:00:00 200 612993 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613124 12 1 3 1.153 0 2022-03-20 07:00:00 200 613124 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613125 12 1 3 278.844 1 2022-03-20 07:00:00 200 613125 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613256 12 1 3 18.506 0 2022-03-20 08:00:00 200 613256 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613257 12 1 3 212.506 1 2022-03-20 08:00:00 200 613257 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613388 12 1 3 57.796 0 2022-03-20 09:00:00 200 613388 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613389 12 1 3 185.578 1 2022-03-20 09:00:00 200 613389 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613520 12 1 3 97.859 0 2022-03-20 10:00:00 200 613520 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613521 12 1 3 189.307 1 2022-03-20 10:00:00 200 613521 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613652 12 1 3 113.126 0 2022-03-20 11:00:00 200 613652 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613653 12 1 3 189.486 1 2022-03-20 11:00:00 200 613653 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613784 12 1 3 115.212 0 2022-03-20 12:00:00 200 613784 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613785 12 1 3 192.305 1 2022-03-20 12:00:00 200 613785 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613916 12 1 3 99.160 0 2022-03-20 13:00:00 200 613916 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
613917 12 1 3 195.759 1 2022-03-20 13:00:00 200 613917 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614048 12 1 3 77.348 0 2022-03-20 14:00:00 200 614048 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614049 12 1 3 214.534 1 2022-03-20 14:00:00 200 614049 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614180 12 1 3 54.285 0 2022-03-20 15:00:00 200 614180 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614181 12 1 3 244.226 1 2022-03-20 15:00:00 200 614181 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614312 12 1 3 15.541 0 2022-03-20 16:00:00 200 614312 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614313 12 1 3 277.902 1 2022-03-20 16:00:00 200 614313 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614444 12 1 3 1.150 0 2022-03-20 17:00:00 200 614444 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614445 12 1 3 319.168 1 2022-03-20 17:00:00 200 614445 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614576 12 1 3 1.150 0 2022-03-20 18:00:00 200 614576 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614577 12 1 3 333.934 1 2022-03-20 18:00:00 200 614577 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614708 12 1 3 1.200 0 2022-03-20 19:00:00 200 614708 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614709 12 1 3 341.857 1 2022-03-20 19:00:00 200 614709 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614840 12 1 3 1.200 0 2022-03-20 20:00:00 200 614840 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614841 12 1 3 332.939 1 2022-03-20 20:00:00 200 614841 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614972 12 1 3 1.200 0 2022-03-20 21:00:00 200 614972 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
614973 12 1 3 334.121 1 2022-03-20 21:00:00 200 614973 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
615104 12 1 3 1.250 0 2022-03-20 22:00:00 200 615104 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
615105 12 1 3 320.538 1 2022-03-20 22:00:00 200 615105 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
615236 12 1 3 1.200 0 2022-03-20 23:00:00 200 615236 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
615237 12 1 3 324.819 1 2022-03-20 23:00:00 200 615237 49 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT

48 rows × 55 columns

In [179]:
merged_df.groupby('county').aggregate(lambda x: x.isnull().sum())
Out[179]:
is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id eic_count_client installed_capacity_client ... cloudcover_low_forecast_weather cloudcover_mid_forecast_weather cloudcover_total_forecast_weather 10_metre_u_wind_component_forecast_weather 10_metre_v_wind_component_forecast_weather direct_solar_radiation_forecast_weather surface_solar_radiation_downwards_forecast_weather snowfall_forecast_weather total_precipitation_forecast_weather datetime_EET
county
0 0 0 56 0 0 0 0 0 672 672 ... 288 288 288 288 288 288 288 288 288 288
1 0 0 24 0 0 0 0 0 288 288 ... 144 144 144 144 144 144 144 144 144 144
2 0 0 30 0 0 0 0 0 384 384 ... 96 96 96 96 96 96 96 96 96 96
3 0 0 32 0 0 0 0 0 384 384 ... 192 192 192 192 192 192 192 192 192 192
4 0 0 38 0 0 0 0 0 480 480 ... 192 192 192 192 192 192 192 192 192 192
5 0 0 40 0 0 0 0 0 576 576 ... 240 240 240 240 240 240 240 240 240 240
6 0 0 8 0 0 0 0 0 96 96 ... 48 48 48 48 48 48 48 48 48 48
7 0 0 46 0 0 0 0 0 1056 1056 ... 288 288 288 288 288 288 288 288 288 288
8 0 0 24 0 0 0 0 0 288 288 ... 144 144 144 144 144 144 144 144 144 144
9 0 0 32 0 0 0 0 0 384 384 ... 192 192 192 192 192 192 192 192 192 192
10 0 0 36 0 0 0 0 0 912 912 ... 240 240 240 240 240 240 240 240 240 240
11 0 0 52 0 0 0 0 0 1392 1392 ... 288 288 288 288 288 288 288 288 288 288
12 0 0 8 0 0 0 0 0 96 96 ... 30624 30624 30624 30624 30624 30624 30624 30624 30624 30624
13 0 0 32 0 0 0 0 0 384 384 ... 144 144 144 144 144 144 144 144 144 144
14 0 0 32 0 0 0 0 0 768 768 ... 192 192 192 192 192 192 192 192 192 192
15 0 0 38 0 0 0 0 0 480 480 ... 192 192 192 192 192 192 192 192 192 192

16 rows × 54 columns

In [180]:
train.query('county == 11').shape
Out[180]:
(198000, 9)
In [181]:
weather_station_to_county_mapping.query('county == 12')
Out[181]:
county_name longitude latitude county

EDA¶

In [182]:
import seaborn as sns
In [183]:
subsampled_df = merged_df.sample(10000)
#sns.scatterplot(data=merged_df, x='datetime', y = 'target', hue='is_consumption')
In [184]:
sns.scatterplot(data=subsampled_df, x='datetime', y = 'target', hue='is_consumption')
Out[184]:
<Axes: xlabel='datetime', ylabel='target'>
No description has been provided for this image
In [185]:
sns.kdeplot(data=subsampled_df, x = 'target', hue='is_consumption')
Out[185]:
<Axes: xlabel='target', ylabel='Density'>
No description has been provided for this image
In [186]:
# subset into production df

production_df = merged_df.query('is_consumption == 0').groupby('datetime').mean(numeric_only=True)

# we should aggregate target by sum, weather variables by  mean

production_df.head()
Out[186]:
county is_business product_type target is_consumption data_block_id row_id prediction_unit_id eic_count_client installed_capacity_client ... cloudcover_high_forecast_weather cloudcover_low_forecast_weather cloudcover_mid_forecast_weather cloudcover_total_forecast_weather 10_metre_u_wind_component_forecast_weather 10_metre_v_wind_component_forecast_weather direct_solar_radiation_forecast_weather surface_solar_radiation_downwards_forecast_weather snowfall_forecast_weather total_precipitation_forecast_weather
datetime
2021-09-01 00:00:00 7.393443 0.508197 2.0 0.071508 0.0 0.0 60.0 30.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-09-01 01:00:00 7.393443 0.508197 2.0 0.194295 0.0 0.0 182.0 30.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-09-01 02:00:00 7.393443 0.508197 2.0 0.030246 0.0 0.0 304.0 30.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-09-01 03:00:00 7.393443 0.508197 2.0 0.024869 0.0 0.0 426.0 30.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-09-01 04:00:00 7.393443 0.508197 2.0 0.040852 0.0 0.0 548.0 30.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 44 columns

In [187]:
production_df.size
Out[187]:
673728
In [188]:
sns.scatterplot(data=production_df, x = 'cloudcover_total_forecast_weather', y = 'target')
Out[188]:
<Axes: xlabel='cloudcover_total_forecast_weather', ylabel='target'>
No description has been provided for this image
In [189]:
sns.scatterplot(data=production_df, x = 'direct_solar_radiation_forecast_weather', y = 'target')
Out[189]:
<Axes: xlabel='direct_solar_radiation_forecast_weather', ylabel='target'>
No description has been provided for this image
In [190]:
sns.scatterplot(data=production_df, x = 'surface_solar_radiation_downwards_forecast_weather', y = 'target', hue='is_business')
Out[190]:
<Axes: xlabel='surface_solar_radiation_downwards_forecast_weather', ylabel='target'>
No description has been provided for this image
In [191]:
production_df = merged_df.query('is_consumption == 0').sample(100000)
sns.scatterplot(data=production_df, x = 'surface_solar_radiation_downwards_forecast_weather', y = 'target', hue='product_type')
Out[191]:
<Axes: xlabel='surface_solar_radiation_downwards_forecast_weather', ylabel='target'>
No description has been provided for this image
- Surface solar radiation seems to have a stronger correlation with target than direct_solar
- There seems to be a 'split' around 6000 (unit?) daily mean target
- We expected more businesses in the top-producers
In [192]:
client.describe()
Out[192]:
product_type county eic_count_client installed_capacity_client is_business date_client data_block_id
count 41919.000000 41919.000000 41919.000000 41919.000000 41919.000000 41919 41919.000000
mean 1.898996 7.297097 73.345118 1450.771451 0.536773 2022-07-18 21:34:22.921348096 322.898876
min 0.000000 0.000000 5.000000 5.500000 0.000000 2021-09-01 00:00:00 2.000000
25% 1.000000 3.000000 13.000000 321.900000 0.000000 2022-02-13 00:00:00 167.000000
50% 2.000000 7.000000 32.000000 645.200000 1.000000 2022-07-20 00:00:00 324.000000
75% 3.000000 11.000000 70.000000 1567.150000 1.000000 2022-12-23 00:00:00 480.000000
max 3.000000 15.000000 1517.000000 19314.310000 1.000000 2023-05-29 00:00:00 637.000000
std 1.081713 4.780750 144.064389 2422.233120 0.498652 NaN 182.075724

Focussing only on larger providers (is there a trend visible?)¶

In [193]:
production_df = merged_df.query('(is_consumption == 0) and (1200 < installed_capacity_client < 1300)')
sns.scatterplot(data=production_df, x = 'surface_solar_radiation_downwards_forecast_weather', y = 'target', hue='product_type')
Out[193]:
<Axes: xlabel='surface_solar_radiation_downwards_forecast_weather', ylabel='target'>
No description has been provided for this image
In [194]:
merged_df.query('is_consumption == 0').groupby('product_type')['installed_capacity_client'].aggregate(['min', 'mean', 'max'])
Out[194]:
min mean max
product_type
0 260.0 920.328878 4968.600
1 60.0 630.766002 5250.705
2 6.0 88.098622 303.000
3 165.0 2465.757342 19314.310
- product_type 2 attract producers with small installed capacity thus low production
- while product_type 3 attract the opposite cluster, producers with a lot of installed_capacity

Further quick Visualizations¶

In [195]:
merged_df.plot(x='datetime', y='euros_per_mwh_electricity_prices')
plt.title('euros_per_mwh_electricity_prices')
plt.show()
No description has been provided for this image
In [196]:
plt.plot(merged_df['datetime'], merged_df['lowest_price_per_mwh_gas_prices'], label = 'lowest price')
plt.plot(merged_df['datetime'], merged_df['highest_price_per_mwh_gas_prices'], label = 'highest price')
plt.title('highest and lowest price_per_mwh_gas_prices')
plt.legend()
plt.show()
No description has been provided for this image
In [197]:
merged_df.plot(x='datetime', y='temperature_hist_weather')
plt.title('temperature_hist_weather')
plt.show()
No description has been provided for this image
In [198]:
merged_df['dewpoint_hist_weather'].plot()
plt.title('dewpoint_hist_weather')
plt.show()
No description has been provided for this image
In [199]:
merged_df['rain_hist_weather'].plot()
plt.title('rain_hist_weather')
plt.show()
No description has been provided for this image
In [200]:
merged_df['snowfall_hist_weather'].plot()
plt.title('snowfall_hist_weather')
plt.show()
No description has been provided for this image
In [201]:
merged_df['cloudcover_total_hist_weather'].plot()
plt.title('cloudcover_total_hist_weather')
plt.show()
No description has been provided for this image
In [202]:
merged_df[['shortwave_radiation_hist_weather','direct_solar_radiation_hist_weather', 'diffuse_radiation_hist_weather']].plot()
plt.title('solar radiation')
plt.show()
No description has been provided for this image
In [203]:
# Create a 1x3 grid of subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))

# Plot the first graph
merged_df.groupby('county')['target'].mean().plot(kind='bar', ax=axes[0])
axes[0].set_title('Average Consumption per County')
axes[0].set_xlabel('County')
axes[0].set_ylabel('Average Consumption')

# Plot the second graph
merged_df.groupby('product_type')['target'].mean().plot(kind='bar', ax=axes[1])
axes[1].set_title('Average Consumption per Product type')
axes[1].set_xlabel('Product type')
axes[1].set_ylabel('Average Consumption')

# Plot the third graph
merged_df.groupby('is_business')['target'].mean().plot(kind='bar', ax=axes[2])
axes[2].set_title('Average Consumption per Business')
axes[2].set_xlabel('Business or not')
axes[2].set_ylabel('Average Consumption')

# Adjust layout to prevent clipping of titles
plt.tight_layout()

# Show the combined plot
plt.show()
No description has been provided for this image
In [204]:
def split_datetime(data, col="datetime"):
    # What columns are of type datetime?
    datetime_columns = data.select_dtypes(include='datetime64').columns
    
    for c in datetime_columns:
        print(f"Timezone for {c} is {data[c].dt.tz}")

    # Adding columns for date & time
    data['year']    = data[col].dt.year
    data['quarter'] = data[col].dt.quarter
    data['month']   = data[col].dt.month
    data['week']    = data[col].dt.isocalendar().week
    data['hour']    = data[col].dt.hour 

    data['day_of_year']  = data[col].dt.day_of_year
    data['day_of_month'] = data[col].dt.day
    data['day_of_week']  = data[col].dt.day_of_week

    return data

# calculate sum of production of business producers and the average over the day of the week
business_production_sum_timeseries = merged_df.query('is_consumption == 0 & is_business == 1')[["datetime", "target"]].groupby("datetime").sum()
business_production_sum_timeseries.reset_index(inplace=True)
business_production_sum_timeseries = split_datetime(business_production_sum_timeseries, "datetime")
# Have a look at the production average per day of the week
business_production_av_day_of_week = business_production_sum_timeseries.groupby(["day_of_week", "hour"])[["target"]].mean()
business_production_av_day_of_week.reset_index(inplace=True)

# calculate sum of production of non-business producers and the average over the day of the week
non_business_production_sum_timeseries = merged_df.query('is_consumption == 0 & is_business == 0')[["datetime", "target"]].groupby("datetime").sum()
non_business_production_sum_timeseries.reset_index(inplace=True)
non_business_production_sum_timeseries = split_datetime(non_business_production_sum_timeseries, "datetime")
# Have a look at the production average per day of the week
non_business_production_av_day_of_week = non_business_production_sum_timeseries.groupby(["day_of_week", "hour"])[["target"]].mean()
non_business_production_av_day_of_week.reset_index(inplace=True)

plt.figure(figsize=(12,8))
plt.plot(business_production_av_day_of_week.index, business_production_av_day_of_week["target"], label="business")
plt.plot(non_business_production_av_day_of_week.index, non_business_production_av_day_of_week["target"], label="non-business")
plt.legend()
Timezone for datetime is None
Timezone for datetime is None
Out[204]:
<matplotlib.legend.Legend at 0x332a20e50>
No description has been provided for this image
Production for businesses is higher on the weekends
In [205]:
# calculate sum of consumption of business producers and the average over the day of the week
business_consumption_sum_timeseries = merged_df.query('is_consumption == 1 & is_business == 1')[["datetime", "target"]].groupby("datetime").sum()
business_consumption_sum_timeseries.reset_index(inplace=True)
business_consumption_sum_timeseries = split_datetime(business_consumption_sum_timeseries, "datetime")
# Have a look at the consumption average per day of the week
business_consumption_av_day_of_week = business_consumption_sum_timeseries.groupby(["day_of_week", "hour"])[["target"]].mean()
business_consumption_av_day_of_week.reset_index(inplace=True)

# calculate sum of consumption of business producers and the average over the day of the week
non_business_consumption_sum_timeseries = merged_df.query('is_consumption == 1 & is_business == 0')[["datetime", "target"]].groupby("datetime").sum()
non_business_consumption_sum_timeseries.reset_index(inplace=True)
non_business_consumption_sum_timeseries = split_datetime(non_business_consumption_sum_timeseries, "datetime")
# Have a look at the consumption average per day of the week
non_business_consumption_av_day_of_week = non_business_consumption_sum_timeseries.groupby(["day_of_week", "hour"])[["target"]].mean()
non_business_consumption_av_day_of_week.reset_index(inplace=True)

plt.figure(figsize=(12,8))
plt.plot(business_consumption_av_day_of_week.index, business_consumption_av_day_of_week["target"], label="business")
plt.plot(non_business_consumption_av_day_of_week.index, non_business_consumption_av_day_of_week["target"], label="non-business")
plt.legend()
Timezone for datetime is None
Timezone for datetime is None
Out[205]:
<matplotlib.legend.Legend at 0x3e4a12950>
No description has been provided for this image
In [206]:
small_producers = merged_df.query('is_consumption == 0 & installed_capacity_client < 1000')
small_producers_sum_timeseries = small_producers[["datetime", "county", "target", "surface_solar_radiation_downwards_forecast_weather", "installed_capacity_client"]].groupby(["county", "datetime"]).agg({"target": "sum", "surface_solar_radiation_downwards_forecast_weather": "mean", "installed_capacity_client": "mean"})
small_producers_sum_timeseries.reset_index(inplace=True)

plt.figure(figsize=(12,8))
sns.scatterplot(data=small_producers_sum_timeseries.query("county == 0"), x="surface_solar_radiation_downwards_forecast_weather", y="target", hue="installed_capacity_client")

large_producers = merged_df.query('is_consumption == 0 & installed_capacity_client >= 1000')
large_producers_sum_timeseries = large_producers[["datetime", "county", "target", "surface_solar_radiation_downwards_forecast_weather", "installed_capacity_client"]].groupby(["county", "datetime"]).agg({"target": "sum", "surface_solar_radiation_downwards_forecast_weather": "mean", "installed_capacity_client": "mean"})
large_producers_sum_timeseries.reset_index(inplace=True)

plt.figure(figsize=(12,8))
sns.scatterplot(data=large_producers_sum_timeseries.query("county == 0"), x="surface_solar_radiation_downwards_forecast_weather", y="target", hue="installed_capacity_client")
Out[206]:
<Axes: xlabel='surface_solar_radiation_downwards_forecast_weather', ylabel='target'>
No description has been provided for this image
No description has been provided for this image
Solar radiation and production seems to be linear, but the line differs from how much capacity you have (more or less two lines running from the same origin, like a "star").
In [207]:
px.line(merged_df.groupby(['county', 'datetime', 'is_consumption']).mean(numeric_only=True).reset_index(),
         x='datetime', y='target', color='county', facet_col='is_consumption')
In [ ]:
 
In [ ]:
 
In [208]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# trace 0 = surface solar radat
# trace 1 = target
# trace 2 = sum of installed capacity

#df = merged_df.query('(is_business not in [1]) and (is_consumption == 0)').groupby(['datetime']).mean(numeric_only=True).reset_index()
aggregations = {'target': 'sum', 'surface_solar_radiation_downwards_forecast_weather': 'mean', 'installed_capacity_client': 'sum'}
try:
    df = merged_df.query('(is_business not in [1]) and (is_consumption == 0)').groupby(['datetime']).aggregate(aggregations).reset_index()
except:
    print('mean applied')
    df = merged_df.query('(is_business not in [1]) and (is_consumption == 0)').groupby(['datetime']).aggregate('mean').reset_index()


fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Scatter(x=df['datetime'], y=df['surface_solar_radiation_downwards_forecast_weather'], opacity=0.7),
    secondary_y=True,
)

fig.add_trace(
    go.Scatter(x=df['datetime'], y=df['target'], opacity=0.7),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df['datetime'], y=df['installed_capacity_client'], opacity=0.7),
    secondary_y=False,
)

# Disable hover information and zooming


fig
In [209]:
merged_df.head()
Out[209]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id eic_count_client ... cloudcover_low_forecast_weather cloudcover_mid_forecast_weather cloudcover_total_forecast_weather 10_metre_u_wind_component_forecast_weather 10_metre_v_wind_component_forecast_weather direct_solar_radiation_forecast_weather surface_solar_radiation_downwards_forecast_weather snowfall_forecast_weather total_precipitation_forecast_weather datetime_EET
0 0 0 1 0.713 0 2021-09-01 0 0 0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
1 0 0 1 96.590 1 2021-09-01 0 1 0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
2 0 0 2 0.000 0 2021-09-01 0 2 1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
3 0 0 2 17.314 1 2021-09-01 0 3 1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
4 0 0 3 2.904 0 2021-09-01 0 4 2 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT

5 rows × 55 columns

Hypotheses and baseline model ideas¶

  • Simple Baseline Ideas
    • Simple baseline model: Average (average) consumption of day before this year would prediction consumption for next year

    • How is weather of day before is influencing production and consumption (if sunshine day before is low, production next day would possibly be lower)

    • Very simple regression model, with few features (solar radiation colors, snow/temperature, capacity)

  • Correlation
    • What are parameters for the highest correlation for consumption (e.g. higher solar radiation, the higher the production; the higher the capacity, the higher the production)
    • Gas prices and electricity prices change -> correlation
  • Two models: production (solar and capacity), consumption (temperature); target represents if consumption or production

Feature engineering¶

Add columns for date / time¶

In [210]:
def add_features(data):
    # What columns are of type datetime?
    datetime_columns = merged_df.select_dtypes(include='datetime64').columns
    
    for c in datetime_columns:
        print(f"Timezone for {c} is {data[c].dt.tz}")

    # Adding columns for date & time
    data['year']    = data['datetime'].dt.year
    data['quarter'] = data['datetime'].dt.quarter
    data['month']   = data['datetime'].dt.month
    data['week']    = data['datetime'].dt.isocalendar().week
    data['hour']    = data['datetime'].dt.hour 

    data['day_of_year']  = data['datetime'].dt.day_of_year
    data['day_of_month'] = data['datetime'].dt.day
    data['day_of_week']  = data['datetime'].dt.day_of_week

    return data

merged_df = add_features(merged_df)

## -> need to convert to EET timezone
Timezone for datetime is None
Timezone for date_client is None
Timezone for forecast_date_gas_prices is None
Timezone for origin_date_gas_prices is None
Timezone for forecast_date_electricity_prices is None
Timezone for origin_date_electricity_prices is None
Timezone for datetime_hist_weather is None

Baseline model¶

model ideas

  • calculate average of one year, use this as a prediction for next year ? (but we know already that seasonality is important)
  • ** prediction of t is equal to t-1year **

to do

  • define x and y, only 'model' on most recent year (2023?)
In [211]:
test = train.copy()

test.head()
Out[211]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id
0 0 0 1 0.713 0 2021-09-01 0 0 0
1 0 0 1 96.590 1 2021-09-01 0 1 0
2 0 0 2 0.000 0 2021-09-01 0 2 1
3 0 0 2 17.314 1 2021-09-01 0 3 1
4 0 0 3 2.904 0 2021-09-01 0 4 2
In [212]:
# create Baseline Model

test_offset = train.copy()
test_offset
test_offset['datetime'] = test_offset['datetime'] + pd.Timedelta(value=365, unit='days')  # PLUS or MINUS?

test_offset.head()
Out[212]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id
0 0 0 1 0.713 0 2022-09-01 0 0 0
1 0 0 1 96.590 1 2022-09-01 0 1 0
2 0 0 2 0.000 0 2022-09-01 0 2 1
3 0 0 2 17.314 1 2022-09-01 0 3 1
4 0 0 3 2.904 0 2022-09-01 0 4 2
In [213]:
# merge df with baseline predictions (in test_offset)
test = test.merge(
    test_offset, 
    on=['county', 'is_business', 'product_type', 'is_consumption', 'datetime'], 
    how='left', 
    suffixes=('', '_previous_year')
)
In [214]:
# idea: define function to pull target from 1 year ago

'''def baseline(date):
    prediction  = target from one year ago   

    return prediction'''
Out[214]:
'def baseline(date):\n    prediction  = target from one year ago   \n\n    return prediction'
In [215]:
from sklearn.metrics import mean_absolute_error

test.dropna(inplace=True)

print('Previous year as prediction:', mean_absolute_error(test['target'], test['target_previous_year']))
Previous year as prediction: 134.51285907725816
In [216]:
# define error fun
def calc_mae(x, y):
    mae = np.mean(np.abs(y - x))
    return mae
In [217]:
# generate df for plotting
plotdat = test.groupby(['datetime', 'is_consumption']).mean().reset_index()
In [218]:
print('MAE consumption: ', calc_mae(plotdat.query('is_consumption == 1')['target'], plotdat.query('is_consumption == 1')['target_previous_year']))
print('MAE production:  ', calc_mae(plotdat.query('is_consumption == 0')['target'], plotdat.query('is_consumption == 0')['target_previous_year']))

print('MAE all:         ', calc_mae(plotdat['target'], plotdat['target_previous_year']))
MAE consumption:  120.16727685427284
MAE production:   55.55173923700824
MAE all:          87.85950804564054

Results of pre-implemented sklearn MAE are different from the defined calc_mae function. It is unclear why, but Kaggle uses the pre-implemented function -> this is what we should optimize for.

In [219]:
fig = make_subplots(specs=[[{"secondary_y": True}]])

test_prod = test.query('is_consumption == 0')

fig.add_trace(
    go.Scatter(x=test_prod['datetime'], y=test_prod['target'], opacity=0.7, name='production_true'),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=test_prod['datetime'], y=test_prod['target_previous_year'], opacity=0.7, name='production_pred'),
    secondary_y=False,
)
# Disable hover information and zooming

fig
In [220]:
# plot actual and baseline-predicted consumption

plt.plot(plotdat.query('is_consumption == 1')['datetime'], plotdat.query('is_consumption == 1')['target'])
plt.plot(plotdat.query('is_consumption == 1')['datetime'], plotdat.query('is_consumption == 1')['target_previous_year'], 'r')

plt.legend(['obs.', 'pred.'])
plt.title('Baseline model: observed and predicted consumption')
Out[220]:
Text(0.5, 1.0, 'Baseline model: observed and predicted consumption')
No description has been provided for this image
In [221]:
# plot actual and baseline-predicted production

plt.plot(plotdat.query('is_consumption == 0')['datetime'], plotdat.query('is_consumption == 0')['target'])
plt.plot(plotdat.query('is_consumption == 0')['datetime'], plotdat.query('is_consumption == 0')['target_previous_year'], 'r')

plt.legend(['obs.', 'pred.'])
plt.title('Baseline model: observed and predicted production')
Out[221]:
Text(0.5, 1.0, 'Baseline model: observed and predicted production')
No description has been provided for this image
In [222]:
# plot actual and baseline-predicted production
# average over data block id, because during night (no sun), production is 0 and therefore plot becomes unreadable

plotdat2 = plotdat.groupby(['data_block_id', 'is_consumption']).mean().reset_index()

plt.plot(plotdat2.query('is_consumption == 0')['datetime'], plotdat2.query('is_consumption == 0')['target'])
plt.plot(plotdat2.query('is_consumption == 0')['datetime'], plotdat2.query('is_consumption == 0')['target_previous_year'], 'r')

plt.legend(['obs.', 'pred.'])
plt.title('Baseline model: observed and predicted production')
Out[222]:
Text(0.5, 1.0, 'Baseline model: observed and predicted production')
No description has been provided for this image

Modelling¶

In [223]:
del train
del client
del gas_prices
del electricity_prices
del forecast_weather
del historical_weather
In [224]:
# copy df for modelling
model_df = merged_df.copy()

# model is not able to handle object type
model_df.drop('time_of_day', axis=1, inplace=True)

# split datetime into meaningful features of int types
model_df = split_datetime(model_df)

# model is not able to handle datetime
model_df = model_df.drop(model_df.select_dtypes(include=['datetime64[ns]', 'datetime64[ns, EET]']).columns, axis=1)

# drop na from target
model_df.dropna(subset=['target'], inplace=True)
Timezone for datetime is None
Timezone for date_client is None
Timezone for forecast_date_gas_prices is None
Timezone for origin_date_gas_prices is None
Timezone for forecast_date_electricity_prices is None
Timezone for origin_date_electricity_prices is None
Timezone for datetime_hist_weather is None
In [225]:
import xgboost as xgb
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
In [226]:
X_train, X_test, y_train,  y_test = train_test_split(model_df.drop('target', axis=1), model_df['target'], test_size=0.3, random_state=0)

bst = XGBRegressor(enable_categorical=True)
bst.fit(X_train, y_train)
y_pred = bst.predict(X_test)

# main optimisation metric
print('Mean absolute error test', mean_absolute_error(y_test, y_pred))
print('Mean absolute error train', mean_absolute_error(y_train, bst.predict(X_train)))
Mean absolute error test 50.735112752114915
Mean absolute error train 49.32964331672551
In [227]:
# first attempt gave us 50.75 mean absolute error
In [228]:
model_df['data_block_id'].describe()
Out[228]:
count    2.017824e+06
mean     3.218754e+02
std      1.826310e+02
min      0.000000e+00
25%      1.660000e+02
50%      3.230000e+02
75%      4.790000e+02
max      6.370000e+02
Name: data_block_id, dtype: float64
In [229]:
# split of old data to train and newer one to test

Xy_train = model_df[model_df.data_block_id < 450]
X_train = Xy_train.drop('target', axis=1)
y_train = Xy_train.target

Xy_test = model_df[model_df.data_block_id >= 450]
X_test = Xy_test.drop('target', axis=1)
y_test = Xy_test.target

bst = XGBRegressor(enable_categorical=True)
bst.fit(X_train, y_train)
y_pred = bst.predict(X_test)

# main optimisation metric
print('Mean absolute error test', mean_absolute_error(y_test, y_pred))
print('Mean absolute error train', mean_absolute_error(y_train, bst.predict(X_train)))
Mean absolute error test 109.3723796305108
Mean absolute error train 44.359425148907334
In [230]:
# divide by dates and use newer ones for validation
In [231]:
bst.feature_names_in_
Out[231]:
array(['county', 'is_business', 'product_type', 'is_consumption',
       'data_block_id', 'row_id', 'prediction_unit_id',
       'eic_count_client', 'installed_capacity_client',
       'lowest_price_per_mwh_gas_prices',
       'highest_price_per_mwh_gas_prices',
       'euros_per_mwh_electricity_prices', 'temperature_hist_weather',
       'dewpoint_hist_weather', 'rain_hist_weather',
       'snowfall_hist_weather', 'surface_pressure_hist_weather',
       'cloudcover_total_hist_weather', 'cloudcover_low_hist_weather',
       'cloudcover_mid_hist_weather', 'cloudcover_high_hist_weather',
       'windspeed_10m_hist_weather', 'winddirection_10m_hist_weather',
       'shortwave_radiation_hist_weather',
       'direct_solar_radiation_hist_weather',
       'diffuse_radiation_hist_weather', 'latitude_hist_weather',
       'longitude_hist_weather', 'latitude_forecast_weather',
       'longitude_forecast_weather', 'hours_ahead_forecast_weather',
       'temperature_forecast_weather', 'dewpoint_forecast_weather',
       'cloudcover_high_forecast_weather',
       'cloudcover_low_forecast_weather',
       'cloudcover_mid_forecast_weather',
       'cloudcover_total_forecast_weather',
       '10_metre_u_wind_component_forecast_weather',
       '10_metre_v_wind_component_forecast_weather',
       'direct_solar_radiation_forecast_weather',
       'surface_solar_radiation_downwards_forecast_weather',
       'snowfall_forecast_weather',
       'total_precipitation_forecast_weather', 'year', 'quarter', 'month',
       'week', 'hour', 'day_of_year', 'day_of_month', 'day_of_week'],
      dtype='<U50')
In [232]:
bst.feature_importances_
Out[232]:
array([0.02317161, 0.19814272, 0.0029233 , 0.16164385, 0.00755489,
       0.01130688, 0.0255705 , 0.01065247, 0.20469157, 0.00137255,
       0.00233338, 0.0032099 , 0.00840222, 0.00161291, 0.00115936,
       0.00040205, 0.00126914, 0.00026398, 0.00053658, 0.00101611,
       0.00041738, 0.00092538, 0.00074864, 0.00248047, 0.00115231,
       0.00131035, 0.01265862, 0.00617154, 0.04587046, 0.00263224,
       0.01358437, 0.02500847, 0.00102796, 0.0005635 , 0.01480848,
       0.00196344, 0.00156044, 0.00141749, 0.00141629, 0.00660226,
       0.06700346, 0.03550554, 0.00196549, 0.        , 0.00048924,
       0.00194408, 0.0076613 , 0.02466489, 0.01165206, 0.00241098,
       0.0371469 ], dtype=float32)
In [233]:
features_dict = {key: value for key, value in zip(bst.feature_names_in_, bst.feature_importances_)}
features_dict
Out[233]:
{'county': 0.02317161,
 'is_business': 0.19814272,
 'product_type': 0.002923302,
 'is_consumption': 0.16164385,
 'data_block_id': 0.0075548883,
 'row_id': 0.011306881,
 'prediction_unit_id': 0.0255705,
 'eic_count_client': 0.0106524695,
 'installed_capacity_client': 0.20469157,
 'lowest_price_per_mwh_gas_prices': 0.0013725488,
 'highest_price_per_mwh_gas_prices': 0.0023333782,
 'euros_per_mwh_electricity_prices': 0.0032099008,
 'temperature_hist_weather': 0.008402225,
 'dewpoint_hist_weather': 0.0016129076,
 'rain_hist_weather': 0.0011593648,
 'snowfall_hist_weather': 0.00040205359,
 'surface_pressure_hist_weather': 0.0012691438,
 'cloudcover_total_hist_weather': 0.0002639827,
 'cloudcover_low_hist_weather': 0.0005365753,
 'cloudcover_mid_hist_weather': 0.0010161146,
 'cloudcover_high_hist_weather': 0.00041738356,
 'windspeed_10m_hist_weather': 0.0009253759,
 'winddirection_10m_hist_weather': 0.0007486421,
 'shortwave_radiation_hist_weather': 0.0024804687,
 'direct_solar_radiation_hist_weather': 0.001152306,
 'diffuse_radiation_hist_weather': 0.0013103458,
 'latitude_hist_weather': 0.012658624,
 'longitude_hist_weather': 0.00617154,
 'latitude_forecast_weather': 0.04587046,
 'longitude_forecast_weather': 0.0026322408,
 'hours_ahead_forecast_weather': 0.013584371,
 'temperature_forecast_weather': 0.025008474,
 'dewpoint_forecast_weather': 0.0010279635,
 'cloudcover_high_forecast_weather': 0.000563501,
 'cloudcover_low_forecast_weather': 0.014808475,
 'cloudcover_mid_forecast_weather': 0.0019634373,
 'cloudcover_total_forecast_weather': 0.0015604437,
 '10_metre_u_wind_component_forecast_weather': 0.0014174874,
 '10_metre_v_wind_component_forecast_weather': 0.001416292,
 'direct_solar_radiation_forecast_weather': 0.0066022635,
 'surface_solar_radiation_downwards_forecast_weather': 0.06700346,
 'snowfall_forecast_weather': 0.035505537,
 'total_precipitation_forecast_weather': 0.0019654892,
 'year': 0.0,
 'quarter': 0.0004892406,
 'month': 0.0019440779,
 'week': 0.0076612984,
 'hour': 0.02466489,
 'day_of_year': 0.011652055,
 'day_of_month': 0.002410979,
 'day_of_week': 0.037146896}
In [234]:
xgb.plot_importance(bst)
plt.title('Feature Importance')
plt.show()
No description has been provided for this image
In [235]:
# hours_ahead_forecast treated as important feature, probably smth to drop)
  • visualisation
  • split by date
  • tweaking the parameters
  • drop some features
  • feature engineering
  • overfitting with traditional train_test_split?
  • try to models/ multiple_output/ other models
In [238]:
# model_df2 = model_df.copy()
# model_df2.drop(['row_id', ])

split_datablock = 300

Xy_train = model_df[model_df.data_block_id < split_datablock]
X_train = Xy_train.drop('target', axis=1)
y_train = Xy_train.target

Xy_test = model_df[model_df.data_block_id >= split_datablock]
X_test = Xy_test.drop('target', axis=1)
y_test = Xy_test.target

bst = XGBRegressor(enable_categorical=True)
bst.fit(X_train, y_train)

y_pred_test = bst.predict(X_test)
y_pred_train = bst.predict(X_train)

# main optimisation metric
print('Mean absolute error test', mean_absolute_error(y_test, y_pred_test))
print('Mean absolute error train', mean_absolute_error(y_train, y_pred_train))
Mean absolute error test 143.24422275958875
Mean absolute error train 39.44093814710219
In [ ]:
#px.line(x=Xy_train.index, y=y_pred_train-y_train, color=Xy_train.month)
In [ ]:
#px.line(data_frame=Xy_test, x=Xy_test.index, y=y_pred_test-y_test, color=Xy_test.month, hover_data='day_of_week')

Plot y_train vs. y_pred¶

In [240]:
trace1 = go.Scatter(x=Xy_train.index, y=y_train, mode='lines', name='y_train', fill='none')
trace2 = go.Scatter(x=Xy_train.index, y=y_pred_train, mode='lines', name='y_pred_train', fill='none')

layout = go.Layout(title='Comparison of y_train and y_pred_train', xaxis=dict(title='Index'), yaxis=dict(title='Values'))

fig = go.Figure(data=[trace1, trace2], layout=layout)

# Show the plot
fig.show()

Target consumption vs. production¶

In [243]:
merged_cons = merged_df[merged_df['is_consumption'] == 1].copy()
merged_prod = merged_df[merged_df['is_consumption'] == 0].copy()

plt.plot(merged_cons.index, merged_cons['target'], label='target consumption', linestyle='-', marker='o', color='red')
plt.plot(merged_prod.index, merged_prod['target'], label='target production', linestyle='-', marker='o', color='green')

# Set labels and title
plt.xlabel('Index')
plt.ylabel('Values')
plt.title('Comparison of target consumption vs. production')

# Add legend
plt.legend()

# Show the plot
plt.show()
No description has been provided for this image
In [244]:
# Create a line plot for consumption
plt.figure(figsize=(12, 6))  # Set the figure size
plt.subplot(1, 2, 1)  # Create the first subplot
for month in merged_cons['month'].unique():
    data = merged_cons[merged_cons['month'] == month]
    plt.plot(data.index, data['target'], label=f'Month {month}')
plt.xlabel('Index')
plt.ylabel('Target')
plt.title('Target Consumption by Month')
plt.legend()

# Create a line plot for production
plt.subplot(1, 2, 2)  # Create the second subplot
for month in merged_prod['month'].unique():
    data = merged_prod[merged_prod['month'] == month]
    plt.plot(data.index, data['target'], label=f'Month {month}')
plt.xlabel('Index')
plt.ylabel('Target')
plt.title('Target Production by Month')
plt.legend()

# Adjust layout for better spacing
plt.tight_layout()

# Show the plots
plt.show()
No description has been provided for this image
In [246]:
merged_cons
Out[246]:
county is_business product_type target is_consumption datetime data_block_id row_id prediction_unit_id eic_count_client ... total_precipitation_forecast_weather datetime_EET year quarter month week hour day_of_year day_of_month day_of_week
1 0 0 1 96.590 1 2021-09-01 00:00:00 0 1 0 NaN ... NaN NaT 2021 3 9 35 0 244 1 2
3 0 0 2 17.314 1 2021-09-01 00:00:00 0 3 1 NaN ... NaN NaT 2021 3 9 35 0 244 1 2
5 0 0 3 656.859 1 2021-09-01 00:00:00 0 5 2 NaN ... NaN NaT 2021 3 9 35 0 244 1 2
7 0 1 0 59.000 1 2021-09-01 00:00:00 0 7 3 NaN ... NaN NaT 2021 3 9 35 0 244 1 2
9 0 1 1 501.760 1 2021-09-01 00:00:00 0 9 4 NaN ... NaN NaT 2021 3 9 35 0 244 1 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2018343 15 0 1 42.401 1 2023-05-31 23:00:00 637 2018343 57 51.0 ... 0.0 2023-05-31 23:00:00+03:00 2023 2 5 22 23 151 31 2
2018345 15 0 3 117.332 1 2023-05-31 23:00:00 637 2018345 58 161.0 ... 0.0 2023-05-31 23:00:00+03:00 2023 2 5 22 23 151 31 2
2018347 15 1 0 197.233 1 2023-05-31 23:00:00 637 2018347 64 15.0 ... 0.0 2023-05-31 23:00:00+03:00 2023 2 5 22 23 151 31 2
2018349 15 1 1 28.404 1 2023-05-31 23:00:00 637 2018349 59 20.0 ... 0.0 2023-05-31 23:00:00+03:00 2023 2 5 22 23 151 31 2
2018351 15 1 3 196.240 1 2023-05-31 23:00:00 637 2018351 60 55.0 ... 0.0 2023-05-31 23:00:00+03:00 2023 2 5 22 23 151 31 2

1009176 rows × 63 columns

In [252]:
# Set y-axis limits based on the combined data
y_min = min(merged_cons['target'])
y_max = max(merged_cons['target'])

# Create a line plot for consumption
plt.figure(figsize=(10, 5))  # Set the figure size
plt.plot(merged_cons['week'], merged_cons['target'], label='Consumption', linestyle='-', marker='o')
plt.xlabel('Week')
plt.ylabel('Target')
plt.title('Target Consumption by Week')
plt.ylim(y_min, y_max)  # Set y-axis limits
plt.legend()

# Create a line plot for production
plt.figure(figsize=(10, 5))  # Set the figure size
plt.plot(merged_prod['week'], merged_prod['target'], label='Production', linestyle='-', marker='o')
plt.xlabel('Week')
plt.ylabel('Target')
plt.title('Target Production by Week')
plt.ylim(y_min, y_max)  # Set y-axis limits
plt.legend()

# Show the plots
plt.show()
No description has been provided for this image
No description has been provided for this image
In [ ]:
target_column = 'residual'

# Exclude non-numeric columns
numeric_columns = Xy_test.select_dtypes(include=['number']).columns
numeric_df = Xy_test[numeric_columns]
numeric_df_cons = numeric_df[numeric_df['is_consumption'] == 1]

# Calculate the correlation matrix
correlation_matrix = numeric_df_cons.corr()

# Select correlations based on the threshold
threshold = 0.15
significant_correlations = correlation_matrix[(correlation_matrix[target_column] > threshold) | (correlation_matrix[target_column] < -threshold)][target_column]

# Plot a heatmap of the significant correlations with the target
plt.figure(figsize=(12, 8))
sns.heatmap(significant_correlations.to_frame(), annot=True, cmap='coolwarm', fmt=".2f", cbar=False)
plt.title(f'Significant Correlations with {target_column}, CONSUM ONLY (Threshold: {threshold})')
plt.show()
No description has been provided for this image
In [ ]:
target_column = 'residual'

# Exclude non-numeric columns
numeric_columns = Xy_test.select_dtypes(include=['number']).columns
numeric_df = Xy_test[numeric_columns]
numeric_df_cons = numeric_df[numeric_df['is_consumption'] == 0]

# Calculate the correlation matrix
correlation_matrix = numeric_df_cons.corr()

# Select correlations based on the threshold
threshold = 0.15
significant_correlations = correlation_matrix[(correlation_matrix[target_column] > threshold) | (correlation_matrix[target_column] < -threshold)][target_column]

# Plot a heatmap of the significant correlations with the target
plt.figure(figsize=(12, 8))
sns.heatmap(significant_correlations.to_frame(), annot=True, cmap='coolwarm', fmt=".2f", cbar=False)
plt.title(f'Significant Correlations with {target_column}, PRODUCTION ONLY (Threshold: {threshold})')
plt.show()
No description has been provided for this image
In [ ]:
 
  • residuals are bigger at the summer time, we quess because production is happening at this time

  • residuals on the test data have weekly pattern

  • last two month predicted very poorly

  • residuals are different depending on how we split our data

  • we see unexpalinable patterns in residuals

  • residuals for consumption and production correlate with different features

  • try residual analysis with traditional test_train_split

  • tweak the model